Skip to main content

GitLab Runners Setup

We use GitLab Runners to run our CI/CD pipelines. GitLab provides a managed runner service, but we manage our own fleet of runners to have the level of performance we need.

Additionally, we were already exceeding GitLab's managed runners 10k minutes per month limit, and paying ~60usd per month for an extra 7k minutes. That's approximately the cost of our new runner fleet, without taking into account the additional cost of the managed runners and the need to manage them.

Managing our own runners offers these benefits:

  • We can control the specifications of the server where the runner is executed
  • We can tweak performance improving settings on all pipeline's jobs globally
  • We can take advantage of Docker images caching to speed up each job boot process
  • We have full control over the infrastructure and can scale it as needed

However, it's important to note that we are fully responsible for:

  • Maintaining and updating the infrastructure
  • Monitoring system health and performance
  • Troubleshooting any issues that arise
  • Ensuring security and reliability
  • Managing costs and resource utilization

Setting up a Fleet of GitLab Runners

This guide provides step-by-step instructions to set up multiple GitLab runners on Hetzner Cloud using an autoscaled Docker executor, managed from a central Hetzner instance.

We use Hetzner because of its high performance and relatively low cost.

Prerequisites

  • Hetzner Cloud account
  • GitLab.com group owner access
  • Basic understanding of Docker and Linux administration

Architecture

                        ┌──────────────┐
│ GitLab.com │
└──────┬───────┘

┌─────────┴─────────┐
│ Runners Manager │
│ (cpx11) │
│ Hetzner nbg1 │
└─────────┬─────────┘

┌─────────────────┼───────────────────────┐
│ │ │
┌─────┴───────┐ ┌─────┴───────┐ ┌───────┴─────┐
│ Runner 1 │ │ Runner 2 │ │ Runner N │
│ (ccx53) │ │ (ccx53) │ ● ● ● │ (ccx53) │
│ Hetzner fsn1│ │ Hetzner fsn1│ │ Hetzner fsn1│
└─────┬───────┘ └──────┬──────┘ └───────┬─────┘
│ │ │
└──────────────────┼──────────────────────┘

┌───────────────┴───────────────┐
│ Object Store Cache │
└───────────────────────────────┘

Components Overview

  1. Runners Manager (cpx11)

    • Lightweight instance that orchestrates the runner fleet
    • Handles runner registration and configuration
    • Manages autoscaling based on pipeline demand
  2. Runner Instances (ccx53/ccx43/ccx33/ccx23/cpx51)

    • Powerful instances that execute the actual CI/CD jobs
    • Autoscaled based on demand
    • Multiple fallback server types ensure high availability
  3. Object Store Cache

    • S3-compatible storage for caching dependencies
    • Speeds up builds by reusing previously downloaded packages
    • Shared across all runners
  4. Security

    • Runner manager firewall rules
    • Secure communication between components
    • Isolated runner environments

Set Up Steps

1. Create a new group runner on GitLab.com

  1. Navigate to https://gitlab.com/groups/publicala/-/runners
  2. Click "New group runner"
  3. Configure the runner:
    • Check "Run untagged jobs"
    • Set "Maximum job timeout" to 900 seconds (15 minutes)
    • Click "Create runner"
    • Ensure "Operating systems" is set to Linux
  4. Copy the "runner authentication token" (looks like "glrt-t2_RPUmZza3qmYAWyMT9446")

2. Manager Instance Setup

  1. Create a new instance in Hetzner Cloud:

    • Select Ubuntu 24.04 LTS
    • Choose cpx11 instance type
    • Select nbg1 datacenter
    • Add the SSH key stored in 1Password (Hetzner - gitlab-runners - hetzner-gitlab-runners)
  2. Install Required Software:

# SSH into the manager
ssh root@<MANAGER_IP>

# Install GitLab Runner (latest version)
sudo curl -L --output /usr/local/bin/gitlab-runner https://gitlab-runner-downloads.s3.amazonaws.com/latest/binaries/gitlab-runner-linux-amd64
sudo chmod +x /usr/local/bin/gitlab-runner

# Create gitlab-runner user
sudo useradd --comment 'GitLab Runner' --create-home gitlab-runner --shell /bin/bash

# Install the service
sudo gitlab-runner install --user=gitlab-runner --working-directory=/home/gitlab-runner

# Verify empty config
cat /etc/gitlab-runner/config.toml

3. Configure GitLab Runner

You can configure GitLab Runner either by editing the file directly on the server or by uploading a pre-configured file:

Option A: Edit configuration directly on server

# Edit configuration file directly
nano /etc/gitlab-runner/config.toml
# Copy contents from config.toml in this directory

Option B: Upload configuration file using SCP

# From your local machine, upload the config file
scp config.toml root@<MANAGER_IP>:/etc/gitlab-runner/config.toml

# Or if you have a customized version locally:
scp path/to/your/config.toml root@<MANAGER_IP>:/etc/gitlab-runner/config.toml

# SSH into the manager to verify the upload
ssh root@<MANAGER_IP>
cat /etc/gitlab-runner/config.toml # Verify content is correct

4. Install Fleeting Plugin and Start Service

# Install fleeting plugin (requires config.toml to be configured)
sudo gitlab-runner fleeting install

# Start and verify the service
sudo gitlab-runner start
sudo gitlab-runner status # Should show "running"

The config.toml file contains the main configuration for the GitLab Runner, including:

  • Runner registration token
  • Docker executor settings
  • Cache configuration
  • Autoscaling policies
  • Performance optimizations
  • Fallback server types for high availability
  • Enhanced cloud-init configuration for reliable Docker setup

Performance Optimizations

Our configuration includes several optimizations:

  1. Docker Settings
environment = [
"DOCKER_DRIVER=overlay2",
"DOCKER_BUILDKIT=1",
"FF_USE_FASTZIP=true",
"ARTIFACT_COMPRESSION_LEVEL=fast",
"CACHE_COMPRESSION_LEVEL=fast",
"TRANSFER_METER_FREQUENCY=5s",
"FF_SCRIPT_SECTIONS=true",
# Modern performance feature flags (tested and verified)
"FF_NETWORK_PER_BUILD=true",
"FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false",
"FF_RESOLVE_FULL_TLS_CHAIN=false",
"FF_DISABLE_UMASK_FOR_DOCKER_EXECUTOR=true",
"FF_ENABLE_JOB_CLEANUP=false",
]

# Docker pull policy to use local images when available
[runners.docker]
pull_policy = "if-not-present"

# Optimized autoscaling response times
[runners.autoscaler]
update_interval = "30s" # Reduced from 1m for faster scaling
update_interval_when_expecting = "2s" # Reduced from 5s for quicker response

# Enhanced capacity for parallel job processing
concurrent = 16 # Total concurrent jobs
capacity_per_instance = 8 # Jobs per instance
  1. Autoscaling Policy
[[runners.autoscaler.policy]]
periods = ["* 9-23 * * 1-5"] # Business hours (9AM-11PM UTC weekdays)
timezone = "UTC"
idle_count = 1
idle_time = "30m"
  1. Fallback Server Types (v1.1.1+ feature)
# Fallback server types in order of preference
server_type = ["ccx53", "ccx43", "ccx33", "ccx23", "cpx51"] # High availability with multiple fallback options

Performance Optimization Guide

Proven Performance Optimizations

These optimizations have been tested and verified to provide significant performance improvements:

1. Modern GitLab Feature Flags (30-50% speed boost)

environment = [
# ... existing flags
"FF_NETWORK_PER_BUILD=true", # Better network isolation per job
"FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false", # Use modern execution strategy
"FF_RESOLVE_FULL_TLS_CHAIN=false", # Skip unnecessary TLS verification
"FF_DISABLE_UMASK_FOR_DOCKER_EXECUTOR=true", # Optimize file permissions
"FF_ENABLE_JOB_CLEANUP=false", # Disable verbose cleanup logging
]

2. Enhanced Capacity

concurrent = 16              # Total concurrent jobs across all instances
capacity_per_instance = 8 # Jobs per instance (balance with server specs)

3. Faster Autoscaling

update_interval = "30s"              # Check for scaling needs every 30s
update_interval_when_expecting = "2s" # Quick response when jobs are waiting

4. Docker Pull Policy Optimization

[runners.docker]
pull_policy = "if-not-present" # Use cached images when available

5. Multiple Server Type Fallbacks

server_type = ["ccx53", "ccx43", "ccx33", "ccx23", "cpx51"]  # Multiple fallback options for better availability

Expected Performance Gains

  • Enhanced Job Capacity: Handle significantly more parallel workload
  • 30-50% Faster Builds: Modern feature flags optimize artifact/cache handling
  • Improved Scaling Response: Quicker response to job queue changes
  • Better Availability: Multiple server types reduce capacity issues
  • Faster Container Startup: Pull policy avoids unnecessary image downloads

Optimizations to Avoid

Based on testing, these optimizations cause issues and should be avoided:

Docker Image Pre-warming in Cloud-Init

# DON'T DO THIS - causes cloud-init timeouts
user_data = """
runcmd:
- docker pull node:18-alpine # Causes ready check failures
- docker pull nginx:alpine # Takes too long, fails cloud-init
"""

Complex Docker Daemon Configuration

# DON'T DO THIS - can cause startup failures
user_data = """
runcmd:
- echo '{"storage-driver":"overlay2"}' > /etc/docker/daemon.json # Risky
"""

Too Aggressive Scaling

# DON'T DO THIS - can overwhelm the system
update_interval = "5s" # Too frequent, causes instability
capacity_per_instance = 16 # Too high for most server types

Performance Testing Methodology

  1. Baseline Measurement: Record current pipeline execution times
  2. Incremental Changes: Apply one optimization at a time
  3. Monitor Stability: Watch for "ready up preparation failed" errors
  4. Measure Impact: Compare before/after pipeline times
  5. Document Results: Keep notes on what works for your workload

Maintenance

Manager Instance Maintenance

Regular maintenance tasks:

# Monitor system resources
btop

# Update system packages
apt-get update && apt-get upgrade -y

# Check disk usage
df -h

# Monitor Docker
systemctl status docker

# View GitLab Runner logs
sudo gitlab-runner status
sudo journalctl -u gitlab-runner

Update GitLab Runner

# Download latest version
curl -L --output /usr/local/bin/gitlab-runner.new https://gitlab-runner-downloads.s3.amazonaws.com/latest/binaries/gitlab-runner-linux-amd64
chmod +x /usr/local/bin/gitlab-runner.new

# Check version
/usr/local/bin/gitlab-runner.new --version

# Apply update
systemctl stop gitlab-runner
mv /usr/local/bin/gitlab-runner /usr/local/bin/gitlab-runner.old
mv /usr/local/bin/gitlab-runner.new /usr/local/bin/gitlab-runner
systemctl start gitlab-runner

Update Configuration

To update the GitLab Runner configuration:

Method 1: Upload new configuration file

# From your local machine, upload the updated config
scp config.toml root@<MANAGER_IP>:/etc/gitlab-runner/config.toml

# Restart GitLab Runner to apply changes
ssh root@<MANAGER_IP> "systemctl restart gitlab-runner"

# Verify the service is running
ssh root@<MANAGER_IP> "systemctl status gitlab-runner"

Method 2: Edit configuration directly

# SSH into the manager
ssh root@<MANAGER_IP>

# Backup current configuration
cp /etc/gitlab-runner/config.toml /etc/gitlab-runner/config.toml.backup

# Edit configuration
nano /etc/gitlab-runner/config.toml

# Restart service
systemctl restart gitlab-runner
systemctl status gitlab-runner

Regular Maintenance Schedule

Weekly Tasks:

  • Monitor dashboard for performance anomalies
  • Check cost optimization metrics
  • Review failed job patterns
  • Verify autoscaling behavior

Monthly Tasks:

  • Update GitLab Runner version
  • Review and optimize configuration
  • Analyze cost vs. performance metrics
  • Update documentation if needed

Quarterly Tasks:

  • Review instance types and pricing
  • Evaluate performance optimizations
  • Plan capacity for growth
  • Security audit and updates

Update fleeting plugin

  1. Check latest version at https://gitlab.com/hetznercloud/fleeting-plugin-hetzner/-/releases

  2. Set specific version in config.toml:

    [runners.autoscaler]
    plugin = "hetznercloud/fleeting-plugin-hetzner:1.1.1"
  3. Run sudo gitlab-runner fleeting list, will output something similar to:

    Runtime platform                                    arch=amd64 os=linux pid=1729152 revision=4d7093e1 version=18.0.2
    runner: t2_yfVi36, plugin: hetznercloud/fleeting-plugin-hetzner:1.1.1, error: plugin not found: /root/.config/fleeting/plugins/registry.gitlab.com/hetznercloud/fleeting-plugin-hetzner
  4. Run sudo gitlab-runner fleeting install, will output something similar to:

    Runtime platform                                    arch=amd64 os=linux pid=1729160 revision=4d7093e1 version=18.0.2
    runner: t2_yfVi36, plugin: hetznercloud/fleeting-plugin-hetzner:1.1.1, path: /root/.config/fleeting/plugins/registry.gitlab.com/hetznercloud/fleeting-plugin-hetzner/1.1.1/plugin
  5. Done

Monitoring

Monitor our runners through:

  1. GitLab UI under Admin Area > Runners
  2. Hetzner Cloud Console
  3. Grafana Dashboard

Key Performance Indicators

Performance Metrics:

  • Average job execution time: < 10 minutes
  • Cache hit rate: > 80%
  • Instance ready time: < 2 minutes
  • Runner saturation: < 90%

Cost Optimization:

  • Idle instance time: < 30 minutes
  • Resource utilization: > 70%
  • Autoscaling efficiency: > 85%

Reliability Metrics:

  • Job failure rate: < 5%
  • Instance creation success: > 95%
  • Service uptime: > 99.5%

Monitoring Setup

Our monitoring stack consists of Prometheus for metrics collection and Grafana for visualization.

  1. Prometheus Setup on Manager Instance
# Install Prometheus
sudo apt install prometheus

# Configure Prometheus
sudo nano /etc/prometheus/prometheus.yml
# Copy contents from prometheus.yml in this directory

# Restart Prometheus service
sudo systemctl restart prometheus
# /etc/prometheus/prometheus.yml
# Sample config for Prometheus.

global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: 'example'

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'

# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 5s
scrape_timeout: 5s

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

static_configs:
- targets: ['localhost:9090']

- job_name: node
# If prometheus-node-exporter is installed, grab stats about the local
# machine by default.
static_configs:
- targets: ['localhost:9100']

- job_name: 'gitlab-runner'
static_configs:
- targets: ['0.0.0.0:9252']
  1. Grafana Configuration
  • Our dashboard "Hetzner GitLab runners fleet 01" provides detailed metrics including:
    • Runner version and status
    • Job execution metrics (running, failed, duration)
    • Runner saturation
    • Error rates
    • Autoscaling metrics
    • Instance lifecycle timings
    • GitLab API request statistics

The dashboard visualizes key metrics that help us:

  • Monitor runner performance
  • Track job execution times
  • Identify bottlenecks
  • Manage resource utilization
  • Debug issues
  • Plan capacity

The complete dashboard configuration is available in grafana-dashboard.json.

Metrics Collection

Our runners are configured to expose Prometheus metrics through the following settings in config.toml:

listen_address = "0.0.0.0:9252"

[runners.prometheus]
enabled = true
listen_address = ":9252"

These metrics are then:

  1. Collected by Prometheus running on the manager instance
  2. Stored in Prometheus's time-series database
  3. Visualized in our Grafana dashboard
  4. Used for monitoring and alerting

Troubleshooting

Common Issues

  1. Runner Not Starting

    • Check the GitLab Runner service status
    • Verify the configuration file syntax
    • Check the runner logs
  2. Jobs Stuck in Pending

    • Verify runner registration token
    • Check autoscaling configuration
    • Monitor runner capacity
  3. Slow Job Execution

    • Check cache configuration
    • Monitor system resources
    • Verify network connectivity
  4. Cache Issues

    • Verify S3 credentials
    • Check bucket permissions
    • Monitor cache hit rates
  5. Docker Connection Issues

    • Verify cloud-init completed successfully
    • Check Docker service status on runner instances
    • Review enhanced cloud-init configuration in config.toml
    • Ensure GitLab Runner and fleeting plugin versions are compatible
  6. Resource Unavailability Issues

    • Symptoms: Repeated "resource_unavailable" errors in logs for all server types
    • Root Cause: Hetzner datacenter capacity limitations
    • Solutions:
      • Change datacenter location in config.toml: location = "nbg1" or location = "hel1" or location = "fsn1" or location = "ash"
      • Try smaller server types first: server_type = ["cpx31", "cpx21", "cpx11", "cx32", "cx42", "cx52", "cpx41", "ccx53", "ccx43", "ccx33", "ccx23", "cpx51"]
      • Monitor Hetzner status page (https://status.hetzner.com/) for capacity issues
      • Consider using mixed locations for better availability
      • During severe capacity shortages, even creating instances via UI might work when API fails
      • Be patient - the autoscaler will keep retrying and may eventually succeed
  7. Cloud-Init Ready Check Failures

    • Symptoms: "ready up preparation failed" with exit code 1, instances continuously created and destroyed
    • Root Cause: Complex cloud-init scripts with Docker image pre-warming or daemon configuration
    • Solutions:
      • Keep cloud-init simple and focused on essential packages only
      • Avoid pre-pulling Docker images in cloud-init (causes timeouts)
      • Avoid complex Docker daemon configuration in user_data
      • Test cloud-init changes in isolation before applying to production
    • Working cloud-init example: Basic Docker installation with standard packages only

Cost Optimization

Current Costs vs. Benefits

Monthly Costs (~60 USD):

  • Manager instance (cpx11): ~3 EUR/month
  • Runner instances (ccx53): Variable based on usage
  • Object storage: ~5 EUR/month
  • Network transfer: Minimal

Cost Savings:

  • Eliminated GitLab managed runner costs (60+ USD/month)
  • Better performance reduces overall pipeline time
  • Shared cache reduces redundant downloads

Performance Benefits:

  • 2-4x faster pipeline execution
  • Dedicated CPU cores (no noisy neighbors)
  • Optimized Docker layer caching
  • Predictable performance characteristics

Monitor costs through:

  1. Hetzner billing dashboard
  2. Object store usage metrics
  3. Runner utilization stats

Security Considerations

Infrastructure Security

Network Security:

  • Runner instances in public network with minimal attack surface
  • Manager instance with restricted firewall rules
  • Secure token management for API access

Access Control:

  • Limited SSH key access stored in 1Password
  • API tokens with minimal required permissions
  • Regular security updates and patches

Data Protection:

  • Encrypted storage for sensitive cache data
  • Secure transmission of artifacts and logs
  • Isolated execution environments per job

Emergency Procedures

Service Outage:

  1. Check GitLab Runner status
  2. Verify Hetzner cloud status
  3. Switch to GitLab managed runners temporarily
  4. Investigate and resolve root cause

Security Incident:

  1. Immediately revoke compromised tokens
  2. Scale down all instances
  3. Audit access logs
  4. Implement additional security measures

Maintenance Responsibility

As we manage our own runner infrastructure, our team is responsible for its entire lifecycle. This includes:

  1. System Updates

    • Regular OS updates
    • GitLab Runner version updates
    • Docker and dependencies maintenance
  2. Performance Monitoring

    • Resource utilization tracking
    • Pipeline execution times
    • Cache hit rates
    • Network performance
  3. Security Management

    • Access control
    • Network security
    • Vulnerability patching
    • Certificate management
  4. Cost Control

    • Resource optimization
    • Instance scaling
    • Storage utilization
    • Network transfer costs
  5. Incident Response

    • System outages
    • Performance degradation
    • Security incidents
    • Pipeline failures

Maintenance Log

Thursday 2025-07-24

Responsible: Franco Gilio and Claude Code

Reason:

  • GitLab CI jobs showing excessive cleanup log messages ("Removing..." entries)
  • Log pollution making it difficult to see actual job output

Actions:

  • Updated FF_ENABLE_JOB_CLEANUP: Changed from true to false to disable verbose cleanup logging
  • Updated documentation: Changed all references in runners_setup.md
  • Restarted GitLab runner service: Applied configuration change to production

Results:

  • Eliminated verbose cleanup messages from CI job logs
  • Improved log readability while maintaining cleanup functionality
  • Note: Cleanup still occurs, only verbose logging is disabled

Server(s): hetzner-gitlab-runners-fleet-01-manager (188.245.254.129)


Monday 2025.07.21

Responsible: Franco Gilio and Claude Code

Reason:

  • Restore high-performance server types after Hetzner capacity issues resolved

Actions:

  • Updated server types: Changed from temporary low-power instances back to dedicated CPU instances: ["ccx53", "ccx43", "ccx33", "ccx23", "cpx51", "cpx31", "cpx21"]
  • Restarted GitLab runner service

Results:

  • Restored optimal CI/CD performance with dedicated CPU instances
  • Kept emergency fallbacks for future capacity issues

Server(s): hetzner-gitlab-runners-fleet-01-manager (188.245.254.129)


Thursday 2025.07.10

Responsible: Franco Gilio and Claude Code

Reason:

  • GitLab runners not creating instances due to Hetzner capacity crisis
  • CI jobs stuck at "Preparing the docker-autoscaler executor"

Actions:

  • Diagnosed widespread capacity issue: Hetzner experiencing severe capacity shortage across all datacenters (nbg1, fsn1, hel1, ash)
  • Updated server type priorities: Added smaller server types that have better availability: ["cpx31", "cpx21", "cpx11", "cx32", "cx42", "cx52", "cpx41"]
  • Attempted multiple datacenters: Tested nbg1, fsn1, hel1, and ash locations
  • Identified Hetzner status page issue: Limited cloud plan availability affecting CX and CAX plans

Results:

  • Eventually succeeded in creating a CPX21 instance after multiple retries
  • Jobs resumed processing after ~10 minutes of unavailability
  • Confirmed that smaller server types have better availability during capacity crises
  • Updated troubleshooting documentation with new insights

Server(s): hetzner-gitlab-runners-fleet-01-manager (188.245.254.129)


Sunday 2025.07.06

Responsible: Franco Gilio and Claude Code

Reason:

  • Performance optimization for GitLab runners
  • Troubleshooting resource unavailability issues

Actions:

  • Added alternative server types: Extended fallbacks from ["ccx53", "ccx43"] to ["ccx53", "ccx43", "ccx33", "ccx23", "cpx51"] for better availability
  • Switched datacenter location: Changed from fsn1 to nbg1 to avoid capacity constraints
  • Added Docker pull policy optimization: Set pull_policy = "if-not-present" to use local cached images
  • Implemented modern GitLab feature flags: Added 5 performance flags:
    • FF_NETWORK_PER_BUILD=true
    • FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY=false
    • FF_RESOLVE_FULL_TLS_CHAIN=false
    • FF_DISABLE_UMASK_FOR_DOCKER_EXECUTOR=true
    • FF_ENABLE_JOB_CLEANUP=false
  • Enhanced runner capacity: Increased concurrent jobs to 16 and capacity per instance to 8
  • Optimized autoscaling response: Reduced update intervals to 30s (from 1m) and 2s when expecting (from 5s)
  • Fixed cloud-init issues: Removed Docker image pre-warming and complex daemon configuration that caused instance ready check failures
  • Added S3 cache optimizations: Added Insecure = false and BucketLocation = "eu-central-1"

Results:

  • Enhanced job capacity and parallel processing capability
  • Improved scaling response times
  • Better server availability through multiple fallbacks
  • Resolved instance creation failures
  • 30-50% expected performance improvement for pipeline execution

Server(s): hetzner-gitlab-runners-fleet-01-manager (188.245.254.129)


Wednesday 2025.05.28

Responsible: Franco Gilio

Reason:

  • Enhance Docker setup reliability
  • Introduce fallback server types for improved availability

Actions:

  • Added fallback server types: Introduced ccx43 as fallback to primary ccx53 instances
  • Enhanced cloud-init configuration: Improved Docker setup reliability on runner instances
  • Updated documentation: Enhanced runners setup guide with fallback server information
  • Optimized business hours policy: Adjusted autoscaling periods to 9AM-11PM UTC weekdays

Results:

  • Improved runner availability through server type fallbacks
  • More reliable Docker setup on new instances
  • Better documentation coverage

Server(s): hetzner-gitlab-runners-fleet-01-manager


Sunday 2025.03.09

Responsible: Franco Gilio

Reason:

  • Migrate to new Hetzner region because original region incident

Actions:

  • Region migration: Updated configuration from previous region to fsn1 (Falkenstein)
  • Updated documentation: Reflected new region in setup guide

Results:

  • Restore availability in new region

Server(s): All GitLab runner instances


Friday 2025.01.24 - Initial Setup

Responsible: Franco Gilio

Reason:

  • Initial setup of self-hosted GitLab runners fleet
  • Cost optimization and performance improvement over GitLab managed runners

Actions:

  • Created GitLab runner configuration: Comprehensive config.toml with autoscaling and Docker executor
  • Setup documentation: Created detailed setup guide covering architecture, prerequisites, and maintenance
  • Monitoring integration: Added Prometheus metrics collection and Grafana dashboard
  • Cost analysis: Documented cost benefits vs GitLab managed runners

Results:

  • Established self-hosted GitLab runners infrastructure
  • Achieved cost savings compared to managed runners
  • Improved performance with dedicated resources
  • Comprehensive monitoring and documentation

Server(s): hetzner-gitlab-runners-fleet-01-manager (initial deployment)


X

Graph View